Mortgage Probability of Default & Mortgage Fraud¶
Data Analyst: Frankie Ma
Introduction: Mortgage fraud and default pose significant risks to both lenders and borrowers in the U.S. housing market. With the rise of online services offering fake documentation and fraudulent verification, lenders face increasing difficulty in validating borrower income and intent. These schemes—often masked as novelty tools—can lead to occupancy misrepresentation, inflated credit profiles, and fraudulent loan approvals. As lenders like Fannie Mae and Freddie Mac tighten their risk protocols, the consequences of undetected fraud ripple through financial institutions, investors, and ultimately taxpayers. At the same time, mortgage defaults continue to trigger complex legal processes, including foreclosures, which carry steep financial and reputational costs. In this project, we apply machine learning models to assess default risk and explore how data-driven strategies can help lenders better detect anomalies, protect against fraudulent applications, and reduce financial losses across varying interest rate scenarios.
from IPython import display
display.Image("Mortgage.png")
[This picture is generated by AI]
Project Goal: Our project aims to develop predictive models that identify high-risk mortgage loan applications by detecting potential defaults and fraudulent behavior. By leveraging machine learning and profit-based evaluation across different interest rate scenarios, we seek to help lenders make more informed, data-driven decisions that reduce financial exposure and enhance loan portfolio quality.
Table of Contents¶
- Section 1 Anomaly Detection through Feature Engineering
- Section 2 Model Building
- Section 3 SHAP
- Section 4 Summary & Conclusions
# Before we get started, let's load all the packages we are going to use in this project.
# data
import pandas as pd
import numpy as np
# visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#import missingno as msno
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
#from wordcloud import WordCloud
from sklearn import datasets
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# styling
%matplotlib inline
sns.set_style('darkgrid')
mpl.rcParams['font.size'] = 12
mpl.rcParams['figure.facecolor'] = '#00000000'
mpl.rcParams['font.size'] = 12
mpl.rcParams['figure.facecolor'] = '#00000000'
import os
#from wordcloud import WordCloud
import warnings
warnings.filterwarnings("ignore")
import shap
Section 1 Anomaly Detection through Feature Engineering ¶
df = pd.read_csv('XYZloan_default_llm.csv')
df.head()
| Unnamed: 0.1 | Unnamed: 0 | AP001 | AP002 | AP003 | AP006 | AP007 | AP008 | CR004 | CR009 | ... | TD005 | TD006 | TD009 | TD010 | TD013 | TD014 | TD022 | TD024 | loan_default | reason | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 76031 | 33 | 1 | 3 | h5 | 4 | 3 | 4 | 63100 | ... | 4 | 1 | 4 | 1 | 4 | 1 | 10.0 | 0.0 | 1 | I’d really appreciate if we could move faster ... |
| 1 | 5 | 23312 | 34 | 1 | 3 | h5 | 5 | 5 | 3 | 53370 | ... | 3 | 1 | 6 | 2 | 7 | 2 | 15.0 | 10.0 | 1 | We’re trying to align the closing date with a ... |
| 2 | 9 | 66033 | 36 | 2 | 1 | ios | 2 | 2 | 3 | 5400 | ... | 4 | 2 | 4 | 2 | 5 | 2 | 25.0 | 0.0 | 1 | It would really help to close this week so I c... |
| 3 | 10 | 41847 | 28 | 1 | 1 | ios | 5 | 5 | 3 | 2000 | ... | 4 | 4 | 7 | 4 | 7 | 4 | 25.0 | 6.0 | 1 | There are some logistics around my move that m... |
| 4 | 13 | 28275 | 35 | 2 | 4 | h5 | 3 | 3 | 4 | 27704 | ... | 4 | 1 | 4 | 1 | 7 | 1 | 25.0 | 0.0 | 1 | I’d like to close by Friday if possible—the se... |
5 rows × 32 columns
data = df.drop(columns=['Unnamed: 0.1', 'Unnamed: 0', 'reason'])
1.1 Target Encoding ¶
# specify categorical & numeric data type
cat_var = ['AP006', 'MB007']
num_var = ['AP001', 'AP002', 'AP003', 'AP007',
'AP008', 'CR004', 'CR009', 'CR015', 'CR017', 'CR018', 'CR019',
'MB005', 'PA022', 'PA023', 'PA028', 'PA029', 'PA031', 'TD001',
'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD022', 'TD024']
X_vars = cat_var + num_var
target = 'loan_default'
data[target].value_counts()
loan_default 0 12924 1 3076 Name: count, dtype: int64
1.1.1 Data Spliting ¶
X = data[X_vars]
y = data[target]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
1.2 Imputing Missing Values ¶
X_train_numvar = X_train[num_var]
missing_columns = X_train_numvar.columns[X_train_numvar.isnull().sum() > 0]
# Display the columns with missing values
missing_columns
Index(['MB005', 'PA022', 'PA023', 'PA028', 'PA029', 'PA031', 'TD022', 'TD024'], dtype='object')
for col in missing_columns:
mean_value = X_train[col].mean()
# Impute the missing values in both X_train and X_test
X_train[col].fillna(mean_value, inplace=True)
X_test[col].fillna(mean_value, inplace=True)
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
num_imputer = SimpleImputer(strategy="mean")
cat_imputer = SimpleImputer(strategy="most_frequent")
ohe = OneHotEncoder(drop="first", handle_unknown="ignore", sparse_output=False)
X_train_num = pd.DataFrame(num_imputer.fit_transform(X_train[num_var]),
columns=num_var, index=X_train.index)
X_test_num = pd.DataFrame(num_imputer.transform(X_test[num_var]),
columns=num_var, index=X_test.index)
X_train_cat_imp = pd.DataFrame(cat_imputer.fit_transform(X_train[cat_var]),
columns=cat_var, index=X_train.index)
X_test_cat_imp = pd.DataFrame(cat_imputer.transform(X_test[cat_var]),
columns=cat_var, index=X_test.index)
X_train_cat_ohe = pd.DataFrame(
ohe.fit_transform(X_train_cat_imp),
index=X_train.index,
columns=ohe.get_feature_names_out(cat_var)
)
X_test_cat_ohe = pd.DataFrame(
ohe.transform(X_test_cat_imp),
index=X_test.index,
columns=ohe.get_feature_names_out(cat_var)
)
Section 2 Model Building ¶
The categorical columns are in X_train_dummy while the numerical columns are filled with mean values in X_train[num_var]. To model the data for following analysis, we will combine the X_train_dummy and X_train[num_var] (similar for test data).
X_train_enc = pd.concat([X_train_num, X_train_cat_ohe], axis=1)
X_test_enc = pd.concat([X_test_num, X_test_cat_ohe], axis=1)
X_train_enc = X_train_enc.replace([np.inf, -np.inf], np.nan).fillna(0.0)
X_test_enc = X_test_enc.replace([np.inf, -np.inf], np.nan).fillna(0.0)
feat_names_all = X_train_enc.columns.tolist()
2.1 Random Forest ¶
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=800,
max_depth=None,
min_samples_split=5,
min_samples_leaf=2,
max_features='sqrt',
bootstrap=True,
class_weight="balanced",
random_state=42,
n_jobs=-1
)
rf.fit(X_train_enc, y_train)
RandomForestClassifier(class_weight='balanced', min_samples_leaf=2,
min_samples_split=5, n_estimators=800, n_jobs=-1,
random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', min_samples_leaf=2,
min_samples_split=5, n_estimators=800, n_jobs=-1,
random_state=42)2.1.1 Model Performance ¶
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix
import matplotlib.pyplot as plt
# Predictions
y_pred = rf.predict(X_test_enc)
y_pred_proba = rf.predict_proba(X_test_enc)[:, 1] # Probabilities for ROC/AUC
# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)
print("Accuracy:", round(accuracy, 4))
print("Precision:", round(precision, 4))
print("Recall:", round(recall, 4))
print("F1 Score:", round(f1, 4))
print("ROC AUC:", round(roc_auc, 4))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.3f}")
plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
Accuracy: 0.8125 Precision: 0.4426 Recall: 0.0455 F1 Score: 0.0826 ROC AUC: 0.6652 Confusion Matrix: [[2573 34] [ 566 27]]
2.1.2 Feature Importance ¶
After obtaining that best random forest model, we are going to find our top 10 important features.
imp_series = pd.Series(rf.feature_importances_, index=feat_names_all).sort_values(ascending=False)
top10_feats = imp_series.head(10).index.tolist()
print("\nTop-10 features:")
print(imp_series.head(10))
Top-10 features: CR009 0.071677 AP001 0.059917 MB005 0.054519 TD013 0.052974 TD009 0.046844 CR019 0.041405 TD005 0.041022 TD024 0.040452 CR018 0.039471 TD014 0.035555 dtype: float64
X_train_top = X_train_enc[top10_feats]
X_test_top = X_test_enc[top10_feats]
rf_top = RandomForestClassifier(
n_estimators=500,
max_depth=6,
min_samples_split=10,
min_samples_leaf=5,
max_features=0.5,
bootstrap=True,
class_weight="balanced",
random_state=42,
n_jobs=-1
)
rf_top.fit(X_train_top, y_train)
RandomForestClassifier(class_weight='balanced', max_depth=6, max_features=0.5,
min_samples_leaf=5, min_samples_split=10,
n_estimators=500, n_jobs=-1, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', max_depth=6, max_features=0.5,
min_samples_leaf=5, min_samples_split=10,
n_estimators=500, n_jobs=-1, random_state=42)auc_top = roc_auc_score(y_test, rf_top.predict_proba(X_test_top)[:, 1])
print(f"Top-10 RF ROC-AUC: {auc_top:.4f}")
Top-10 RF ROC-AUC: 0.6205
import shap
explainer = shap.TreeExplainer(rf_top)
shap_values = explainer(X_test_top)
print("values shape:", shap_values.values.shape)
values shape: (3200, 10, 2)
sv_pos = shap_values[:, 1]
print("sv_pos shape:", sv_pos.values.shape)
sv_pos shape: (3200, 2)
sv_pos = shap.Explanation(
values = shap_values.values[:, :, 1], # (n_samples, n_features)
base_values = shap_values.base_values[:, 1], # (n_samples,)
data = X_test_top.values, # match the same rows/cols
feature_names= list(X_test_top.columns)
)
shap.plots.bar(sv_pos, max_display=10, clustering=None)
This SHAP bar plot shows the top 10 features driving the model’s predictions on average. Variables such as TD013, CR009, and TD009 have the strongest influence, each contributing around +0.03 to the prediction, while others like TD005 and MB005 also play important roles. The remaining features have smaller but still noticeable impacts, suggesting that a few key variables dominate the model’s decision-making process.
rows = [0, 1, 2, 3]
fig, axes = plt.subplots(2, 2, figsize=(20,12))
for ax, r in zip(axes.ravel(), rows):
plt.sca(ax)
shap.plots.bar(sv_pos[r], show=False, max_display=10)
ax.set_title(f"Observation {r}")
plt.tight_layout()
plt.show()
These SHAP bar plots illustrate the feature contributions for four individual observations. In each case, CR009 consistently has the strongest negative impact on the prediction, lowering the output significantly. Other features such as TD005, MB005, and TD024 occasionally provide small positive contributions, but their influence is much weaker compared to CR009, showing that this feature dominates the model’s decision for these instances.
3.2.2 Summary Plot ¶
shap.plots.beeswarm(sv_pos, max_display=10)
The SHAP summary plot shows how the top features influence the model’s predictions. Features like TD013, CR009, and TD009 have the strongest impact, with both positive and negative SHAP values, meaning they can either increase or decrease the prediction depending on their values. The color gradient shows that higher feature values (red) tend to push the prediction upward, while lower values (blue) generally push it downward, highlighting the directional effect of each feature.
# Observation 1
shap.plots.waterfall(sv_pos[0],max_display=10)
The waterfall plot of the first observation shows how individual features contributed to lowering the model output to 0.432, below the base value of 0.5. The largest negative impact came from CR009 (-0.08), followed by smaller negative contributions from CR019 and CR018. In contrast, features like TD024, TD005, TD009, and MB005 had small positive effects, slightly offsetting the drop but not enough to outweigh the strong downward pull from CR009.
3.2.3.2 Observation 2 ¶
# Observation 2
shap.plots.waterfall(sv_pos[1],max_display=10)
The waterfall plot of the second observation shows that the model prediction of 0.308 is well below the base value of 0.5. The largest negative drivers were TD013 (-0.09), TD009 (-0.08), and TD005 (-0.04), which strongly pulled the prediction downward. On the other hand, MB005 (+0.02) and smaller contributions from CR009 and CR018 (+0.01 each) slightly increased the score, but their impact was not enough to counteract the strong negative effects.
3.2.3.3 Observation 3 ¶
# Observation 3
shap.plots.waterfall(sv_pos[2],max_display=10)
This SHAP waterfall plot shows that the model predicted 0.445, slightly below the base value of 0.5. The main negative driver was CR009 (-0.06), which significantly lowered the score. In contrast, TD024 (+0.01) and MB005 (+0.01) provided small positive contributions, but these were outweighed by additional negative effects from CR019 (-0.01) and TD005 (-0.01), keeping the final prediction below the baseline.
3.2.3.4 Observation 4 ¶
# Observation 4
shap.plots.waterfall(sv_pos[3],max_display=10)
This SHAP waterfall plot shows that the model predicted 0.427, below the baseline of 0.5. The largest negative impact came from CR009 (-0.08), which strongly reduced the prediction. While TD005, MB005, and TD024 provided small positive contributions (+0.01 each), they were outweighed by additional negative effects from CR019 (-0.01) and TD024 (-0.01), leading to an overall lower prediction.
row = 0
shap.initjs()
shap.force_plot(
base_value = sv_pos.base_values[row],
shap_values = sv_pos.values[row, :],
features = X_test_top.iloc[row, :],
feature_names = sv_pos.feature_names
)
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
This SHAP force plot shows that the model’s prediction is 0.43, lower than the baseline of ~0.50. Features MB005, TD009, TD005, and TD024 pushed the prediction upward, but their influence was outweighed by strong downward contributions from CR009, CR019, and CR018, with CR009 having the largest negative effect. Overall, the negative impacts dominated, lowering the final prediction below the baseline.
3.2.4.2 Observation 2 ¶
row = 1
shap.initjs()
shap.force_plot(
base_value = sv_pos.base_values[row],
shap_values = sv_pos.values[row, :],
features = X_test_top.iloc[row, :],
feature_names = sv_pos.feature_names
)
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
This SHAP force plot shows the model’s prediction of 0.31, which is significantly lower than the baseline of ~0.50. While CR009 and MB005 provided small upward pushes, their effect was minimal compared to the strong downward contributions from TD013, TD009, TD005, TD014, and TD024. These negative influences collectively drove the prediction well below the baseline.
3.2.4.3 Observation 3 ¶
row = 2
shap.initjs()
shap.force_plot(
base_value = sv_pos.base_values[row],
shap_values = sv_pos.values[row, :],
features = X_test_top.iloc[row, :],
feature_names = sv_pos.feature_names
)
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
This SHAP force plot shows a prediction of 0.45, slightly below the baseline of ~0.50. Features MB005 and TD024 increased the prediction (pushing it higher), but their influence was outweighed by strong downward contributions from CR009, CR019, TD005, and AP001, which collectively pulled the score lower. This balance of effects explains why the final prediction remained below the baseline.
3.2.4.4 Observation 4 ¶
row = 3
shap.initjs()
shap.force_plot(
base_value = sv_pos.base_values[row],
shap_values = sv_pos.values[row, :],
features = X_test_top.iloc[row, :],
feature_names = sv_pos.feature_names
)
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
This SHAP force plot shows a prediction of 0.43, which is below the baseline of ~0.50. Features like TD013, MB005, and TD005 pushed the prediction higher, but their influence was outweighed by strong downward contributions from CR009 and CR019, which significantly reduced the score. As a result, the final prediction leaned toward a lower risk outcome.
3.2.5 Dependence Plot ¶
top_idx = np.argsort(np.abs(sv_pos.values).mean(0))[::-1][:10]
top_feats = [sv_pos.feature_names[i] for i in top_idx]
shap.plots.scatter(sv_pos[:, top_feats[0]])
shap.plots.scatter(sv_pos[:, top_feats[0]], color=sv_pos[:, top_feats[1]])
These dependence plots show how TD013 influences the model prediction.
- In the first graph, as TD013 increases, its SHAP value rises sharply from negative to strongly positive, peaking around 15–20 before flattening, meaning higher TD013 values increase the likelihood of a positive prediction.
- In the second graph, we see the same trend but colored by CR009. Darker red points (higher CR009 values) are more concentrated where SHAP values are positive, suggesting that high TD013 combined with high CR009 amplifies the positive impact on the model’s output.
3.2.6 Heatmap ¶
shap.plots.heatmap(sv_pos[:, top_feats], instance_order=shap.Explanation.abs.mean(1))
<Axes: xlabel='Instances'>
This summary plot shows the variation of SHAP values across all features and instances, indicating how each feature impacts the model’s predictions. TD013, CR009, and TD009 stand out with strong positive (red) and negative (blue) contributions, suggesting they are the most influential drivers of the model output. The alternating bands of red and blue highlight how the same feature can either push the prediction higher or lower depending on its value. In contrast, features like AP001 and CR019 show lighter and more neutral effects, meaning they have a relatively smaller influence on predictions.
3.2.7 Decision Plot Comparison ¶
rows = [0, 1, 2, 3] # choose any 4 rows you like
fig, axes = plt.subplots(2, 2, figsize=(14, 8))
axes = axes.ravel()
for ax, r in zip(axes, rows):
plt.sca(ax)
shap.decision_plot(
base_value = sv_pos.base_values[r],
shap_values = sv_pos.values[r],
features = X_test_top.iloc[r, :],
feature_names = sv_pos.feature_names,
show=False
)
ax.set_title(f"Observation {r}")
plt.tight_layout()
plt.show()
These decision plots illustrate how different features contribute to the model predictions for four individual observations. For Observations 0, 2, and 3, CR009 has the strongest negative impact, pulling the prediction downward, while smaller contributions from features like TD024, MB005, and CR019 slightly offset this effect. Observation 1, on the other hand, is mainly influenced by TD013 and TD009, both driving the prediction lower. Overall, the plots highlight that CR009 consistently dominates the prediction direction, while other features contribute in smaller, observation-specific ways.
3.3 Feature Importance Under SHAP ¶
from scipy.special import softmax
def print_feature_importances_shap_values(shap_values, features):
'''
Prints the feature importances based on SHAP values in an ordered way
shap_values -> The SHAP values calculated from a shap.Explainer object
features -> The name of the features, on the order presented to the explainer
'''
importances = []
for i in range(shap_values.values.shape[1]):
importances.append(np.mean(np.abs(shap_values.values[:, i])))
importances_norm = softmax(importances)
feature_importances = {fea: imp for imp, fea in zip(importances, features)}
feature_importances_norm = {fea: imp for imp, fea in zip(importances_norm, features)}
feature_importances = {k: v for k, v in sorted(feature_importances.items(), key = lambda item: item[1], reverse = True)}
feature_importances_norm = {k: v for k, v in sorted(feature_importances_norm.items(), key = lambda item: item[1], reverse = True)}
for k, v in feature_importances.items():
print(f"{k} -> {v:.4f} (softmax = {feature_importances_norm[k]:,.4f})")
print_feature_importances_shap_values(shap_values, top10_feats)
TD013 -> 0.0319 (softmax = 0.1017) CR009 -> 0.0269 (softmax = 0.1012) TD009 -> 0.0253 (softmax = 0.1010) TD005 -> 0.0176 (softmax = 0.1002) MB005 -> 0.0160 (softmax = 0.1001) TD014 -> 0.0090 (softmax = 0.0994) TD024 -> 0.0069 (softmax = 0.0992) CR018 -> 0.0068 (softmax = 0.0992) CR019 -> 0.0058 (softmax = 0.0991) AP001 -> 0.0050 (softmax = 0.0990)
The feature importance results show that TD013, CR009, and TD009 are the top three contributors, each with similar SHAP-based importance scores (~0.031–0.025) and softmax weights around 0.10. These features dominate the model’s decision-making, while the remaining features have much smaller influences, indicating they play only marginal roles in shaping predictions. This suggests the model’s behavior is strongly driven by a small set of key variables.
Section 4 Summary & Conclusions ¶
Summary
Our SHAP analysis highlights that the model’s predictions are heavily influenced by a small set of features, with TD013, CR009, and TD009 emerging as the most impactful drivers. Dependence and force plots show that increases in TD013 tend to push predictions upward, while CR009 generally exerts a strong negative pull. Interaction effects between these variables amplify their importance, as seen when high values of TD013 align with high CR009. Other features, such as TD005 and MB005, contribute moderately, while the remaining predictors play only marginal roles.
Conclusions
The findings suggest that lender risk assessment should focus closely on a handful of dominant indicators, particularly TD013, CR009, and TD009, as they shape most of the model’s predictive power. This concentration of influence implies that improving data quality and monitoring around these key variables could substantially enhance fraud detection and default prediction. At the same time, the relatively small impact of the other features indicates diminishing returns in expanding feature sets without addressing these core drivers. Overall, SHAP analysis not only confirms which variables matter most but also provides transparency into how they push individual predictions higher or lower—helping lenders adopt more targeted and interpretable risk management strategies.